Training with Cloud Machine Learning Engine

This notebook is the second of a set of steps to run machine learning on the cloud. In this step, we will use the data and associated analysis metadata prepared in the previous notebook and continue with training a model.

Workspace Setup

The first step is to setup the workspace that we will use within this notebook - the python libraries, and the Google Cloud Storage bucket that will be used to contain the inputs and outputs produced over the course of the steps.


In [8]:
import google.datalab as datalab
import google.datalab.ml as ml
import mltoolbox.regression.dnn as regression
import os
import time

The storage bucket was created in the previous notebook. We'll re-declare it here, so we can use it.


In [3]:
storage_bucket = 'gs://' + datalab.Context.default().project_id + '-datalab-workspace/'
storage_region = 'us-central1'

workspace_path = os.path.join(storage_bucket, 'census')

Data and DataSets

We'll also enumerate our data and declare DataSets for use during training.


In [4]:
!gsutil ls -r {workspace_path}/data


gs://cloud-ml-users-datalab-workspace/census/data/:
gs://cloud-ml-users-datalab-workspace/census/data/eval.csv
gs://cloud-ml-users-datalab-workspace/census/data/schema.json
gs://cloud-ml-users-datalab-workspace/census/data/train.csv

In [5]:
train_data_path = os.path.join(workspace_path, 'data/train.csv')
eval_data_path = os.path.join(workspace_path, 'data/eval.csv')
schema_path = os.path.join(workspace_path, 'data/schema.json')

train_data = ml.CsvDataSet(file_pattern=train_data_path, schema_file=schema_path)
eval_data = ml.CsvDataSet(file_pattern=eval_data_path, schema_file=schema_path)

Data Analysis

We had previously analyzed training data to produce statistics and vocabularies. These will be used during training.


In [6]:
analysis_path = os.path.join(workspace_path, 'analysis')

In [7]:
!gsutil ls {analysis_path}


gs://cloud-ml-users-datalab-workspace/census/analysis/schema.json
gs://cloud-ml-users-datalab-workspace/census/analysis/stats.json
gs://cloud-ml-users-datalab-workspace/census/analysis/vocab_AGEP.csv
gs://cloud-ml-users-datalab-workspace/census/analysis/vocab_COW.csv
gs://cloud-ml-users-datalab-workspace/census/analysis/vocab_ESP.csv
gs://cloud-ml-users-datalab-workspace/census/analysis/vocab_ESR.csv
gs://cloud-ml-users-datalab-workspace/census/analysis/vocab_FOD1P.csv
gs://cloud-ml-users-datalab-workspace/census/analysis/vocab_HINS4.csv
gs://cloud-ml-users-datalab-workspace/census/analysis/vocab_INDP.csv
gs://cloud-ml-users-datalab-workspace/census/analysis/vocab_JWMNP.csv
gs://cloud-ml-users-datalab-workspace/census/analysis/vocab_JWTR.csv
gs://cloud-ml-users-datalab-workspace/census/analysis/vocab_MAR.csv
gs://cloud-ml-users-datalab-workspace/census/analysis/vocab_POWPUMA.csv
gs://cloud-ml-users-datalab-workspace/census/analysis/vocab_PUMA.csv
gs://cloud-ml-users-datalab-workspace/census/analysis/vocab_RAC1P.csv
gs://cloud-ml-users-datalab-workspace/census/analysis/vocab_SCHL.csv
gs://cloud-ml-users-datalab-workspace/census/analysis/vocab_SCIENGRLP.csv
gs://cloud-ml-users-datalab-workspace/census/analysis/vocab_SERIALNO.csv
gs://cloud-ml-users-datalab-workspace/census/analysis/vocab_SEX.csv
gs://cloud-ml-users-datalab-workspace/census/analysis/vocab_WKW.csv

Training

Training in cloud is accomplished by submitting jobs to Cloud Machine Learning Engine. When submitting jobs, it is a good idea to name each job, so it can be looked up easily (names do need to be unique within the scope of a project).

Additionally you'll want to pick a region where your job will run. Usually this is in the same region as where your training data resides.

Finally, you'll want to pick a scale tier. The documentation describes different scale tiers or custom cluster setups you can use with ML Engine. For the purposes of this sample, a simple single node cluster suffices.


In [20]:
config = ml.CloudTrainingConfig(region=storage_region, scale_tier='BASIC')

training_job_name = 'census_regression_' + str(int(time.time()))
training_path = os.path.join(workspace_path, 'training')

In [21]:
features = {
  "WAGP": {"transform": "target"},
  "SERIALNO": {"transform": "key"},
  "AGEP": {"transform": "embedding", "embedding_dim": 2},  # Age
  "COW": {"transform": "one_hot"},                         # Class of worker
  "ESP": {"transform": "embedding", "embedding_dim": 2},   # Employment status of parents
  "ESR": {"transform": "one_hot"},                         # Employment status
  "FOD1P": {"transform": "embedding", "embedding_dim": 3}, # Field of degree
  "HINS4": {"transform": "one_hot"},                       # Medicaid
  "INDP": {"transform": "embedding", "embedding_dim": 5},  # Industry
  "JWMNP": {"transform": "embedding", "embedding_dim": 2}, # Travel time to work
  "JWTR": {"transform": "one_hot"},                        # Transportation
  "MAR": {"transform": "one_hot"},                         # Marital status
  "POWPUMA": {"transform": "one_hot"},                     # Place of work
  "PUMA": {"transform": "one_hot"},                        # Area code
  "RAC1P": {"transform": "one_hot"},                       # Race
  "SCHL": {"transform": "one_hot"},                        # School
  "SCIENGRLP": {"transform": "one_hot"},                   # Science
  "SEX": {"transform": "one_hot"},
  "WKW": {"transform": "one_hot"}                          # Weeks worked
}

NOTE: To facilitate re-running this notebook, any previous training outputs are first deleted, if they exist.


In [ ]:
!gsutil rm -rf {training_path}

NOTE: The job submitted below can take a few minutes to complete. Once you have submitted, you can continue with more steps in the notebook, until the call to job.wait().


In [22]:
job = regression.train_async(train_dataset=train_data, eval_dataset=eval_data,
                             features=features,
                             analysis_dir=analysis_path,
                             output_dir=training_path,
                             max_steps=2000,
                             layer_sizes=[5, 5, 5],
                             job_name=training_job_name,
                             cloud=config)


Building package and uploading to gs://cloud-ml-users-datalab-workspace/census/training/staging/trainer.tar.gz
Job request send. View status of job at
https://console.developers.google.com/ml/jobs?project=cloud-ml-users

When a job is submitted to ML Engine, a few things happen. The code for the job is staged in Google Cloud Storage, and a job definition is submitted to the service.

The service queues the job, and thereafter the job can be monitored in the console (status and logs), as well as using TensorBoard. The service also provisions computation resources based on the choice of scale tier, installs your code package and its dependencies, and starts your training process. Thereafter, the service monitors the job for completion, and retries if necessary.

The first step in the process - launching a training cluster - can take a few minutes. It is recommended to use BASIC tier to first validate jobs on cloud and use that for faster iteration to benefit from quicker job starts, and then launch larger scaled jobs where the overhead of launching a cluster is small relative to the life of the job itself.

You can check the progress of the job using the link to the console page above, as well as its logs.

TensorBoard

TensorBoard can be launched against your training output directory. As summaries are produced from your running job, they will show up in TensorBoard.


In [ ]:
tensorboard_pid = ml.TensorBoard.start(training_path)

The Trained Model

Once training is completed, the resulting trained model is saved and placed into Cloud Storage.


In [23]:
# Wait for the job to be complete before proceeding.
job.wait()


Out[23]:
Job census_regression_1488915059 completed

In [18]:
!gsutil ls -r {training_path}/model


gs://cloud-ml-users-datalab-workspace/census/training/model/:
gs://cloud-ml-users-datalab-workspace/census/training/model/
gs://cloud-ml-users-datalab-workspace/census/training/model/saved_model.pb

gs://cloud-ml-users-datalab-workspace/census/training/model/assets.extra/:
gs://cloud-ml-users-datalab-workspace/census/training/model/assets.extra/
gs://cloud-ml-users-datalab-workspace/census/training/model/assets.extra/features.json
gs://cloud-ml-users-datalab-workspace/census/training/model/assets.extra/schema.json

gs://cloud-ml-users-datalab-workspace/census/training/model/variables/:
gs://cloud-ml-users-datalab-workspace/census/training/model/variables/
gs://cloud-ml-users-datalab-workspace/census/training/model/variables/variables.data-00000-of-00001
gs://cloud-ml-users-datalab-workspace/census/training/model/variables/variables.index

Cleanup


In [16]:
ml.TensorBoard.stop(tensorboard_pid)

Next Steps

Once a model has been created, the next step is to evaluate it, possibly against multiple evaluation steps. We'll continue with this step in the next notebook.